Air Passengers Data Set

Column

Basic Concepts

Before we begin analysing such data, we first need to create a mathematical framework to work in. Fortunately, we do not need anything too complicated, and for a finite time-series of length \(N\), we model the time series as a sequence of \(N\) random variables, \(X_i\), with \(i = 1, 2, ..., N\).

Note that each individual \(X_i\) is a wholly separate random variable: we only ever have a single measurement from each random variable. In many cases we simplify this, but it is important to understand and appreciate that such simplifications are just that. Time series are difficult to analyse.

Before we get to any of that though, and before we try to build any kind of models for the data, we always start with visualising the data. Often, a simple plot of the data helps use pick out aspects to analyse and incorporate into the models. For time series, one of the first things to do is the time plot, a simple plot of the data over time.

For the passenger data, a few aspects stand out that are very common in time series. It is apparent that the numbers increase over time, and this systematic change in the data is called the trend. Often, approximating the trend as a linear function of time is adequate for many data sets.

Plotting

A famous example of a time-series is count of airline passengers in the US, as shown in the figure below. This is a simple univariate time-series, with measurements taken on a monthly basis over a number of years, each datum consisting of a single number - the number of passengers travelling via a commercial flight that month.

plot(AirPassengers)

Data

A repeating pattern in the data that occurs over the period of the data (in this case, each year), is called the seasonal variation, though a more general concept of ‘season’ is implied — it often will not coincide with the seasons of the calendar.

A slightly more generalised concept from the seasonality is that of cycles, repeating patterns in the data that do not correspond to the natural fixed periods of the model. None of these are apparent in the air passenger data, and accounting for them are beyond the scope of this introductory tutorial.

Finally, another important benefit of visualising the data is that it helps identify possible outliers and erroneous data.

In many cases, we will also be dealing with time series that have multiple values at all, many or some of the points in time.

Often, these values will be related in some ways, and we will want to analyse those relationships also. In fact, one of the most efficient methods of prediction is to find leading indicators for the value or values you wish to predict — you can often use the current values of the leading indicators to make inference on future values of the related quantities.

The fact that this is one of the best methods in time series analysis says a lot about the difficulty of prediction (Yogi Berra, a US baseball player noted for his pithy statements, once said “Prediction is difficult, especially about the future”).

Example Timeseries

In this workshop we will look at a number of different time-series, discussed here.

This data comes in a few different format, and this workshop discusses methods for analysing this data in a common format.

Visualizating a Time Series

Column

Visualizating a Time Series

Before we get to any of that though, and before we try to build any kind of models for the data, we always start with visualising the data. Often, a simple plot of the data helps use pick out aspects to analyse and incorporate into the models. For time series, one of the first things to do is the , a simple plot of the data over time.

For the passenger data, a few aspects stand out that are very common in time series. It is apparent that the numbers increase over time, and this systematic change in the data is called the . Often, approximating the trend as a linear function of time is adequate for many data sets.

A repeating pattern in the data that occurs over the period of the data (in this case, each year), is called the , though a more general concept of `season’ is implied — it often will not coincide with the seasons of the calendar.

A slightly more generalised concept from the seasonality is that of , repeating patterns in the data that do not correspond to the natural fixed periods of the model. None of these are apparent in the air passenger data, and accounting for them are beyond the scope of this introductory tutorial.

Finally, another important benefit of visualising the data is that it helps identify possible and data.

Worked Exercise 1.1

Load the air passengers data into your workspace and investigate the structure of the ts} object usingstr(). How is atsobject different from a standard vector in R? Plot it using the default \text{plot} method. <p> #### Worked Exercise 1.2 Using the data supplied in the fileMaine.dat} and the function read.table(), load the Maine unemployment data into your workspace and repeat the tasks above.

Worked Exercise 1.3

Analyse the trend and seasonality for the air passenger data by using the aggregate() and cycle() functions. Create a boxplot for the data, segmenting the data by month.

Worked Exercise 1.4

Repeat the above analysis for Maine unemployment data.

Worked Exercise 1.5

Calculate the average monthly data for each of the above time series. Compare this to the actual monthly data and plot them together. What can we learn from this?

Worked Exercise 1.6

Using the window() function, calculate quantitative values for the above.

Solutions

Worksheet Exercise 1.1

Load the air passengers data into your workspace and investigate the structure of the object using str(). How is a object different from a standard vector in R? Plot it using the default plot() method.

class(AP.ts);
[1] "ts"
str(AP.ts);
 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
start(AP.ts); end(AP.ts); frequency(AP.ts);
[1] 1949    1
[1] 1960   12
[1] 12
plot(AP.ts, ylab = "Air Passengers (\'000s)");

### Using ggplot2 looks better, but you have to work hard for the
### labels on the x-axis so I am leaving this out for now.

qplot(1:length(AP.ts), as.vector(AP.ts), geom = 'line', ylab = 'Air Passengers (\'000s)');

Worksheet Exercise 1.2

Using the data supplied in the file and the function read.table(), load the Maine unemployment data into your workspace and repeat the tasks above.

class(MA.month.ts);
[1] "ts"
str(MA.month.ts);
 Time-Series [1:128] from 1996 to 2007: 6.7 6.7 6.4 5.9 5.2 4.8 4.8 4 4.2 4.4 ...
start(MA.month.ts); end(MA.month.ts); frequency(MA.month.ts);
[1] 1996    1
[1] 2006    8
[1] 12
plot(MA.month.ts, ylab = "Unemployment data for the state of Maine");

### Using ggplot2 looks better, but you have to work hard for the
### labels on the x-axis so I am leaving this out for now.

qplot(1:length(MA.month.ts), as.vector(MA.month.ts), geom = 'line', ylab = 'Unemployment data for the state of Maine');

Worksheet Exercise 1.3

Analyse the trend and seasonality for the air passenger data by using the aggregate() and cycle() functions. Create a boxplot for the data, segmenting the data by month.

### We are going to aggregate over the years, and extract the cycles
AP.year.ts  <- aggregate(AP.ts);
AP.cycle.ts <- cycle(AP.ts);
### We are going to stack the two plots together
layout(1:2)
plot(AP.year.ts)
boxplot(AP.ts ~ AP.cycle.ts)

### Create a plot in ggplot2

#plot1 <- qplot(start(AP.year.ts)[1]:end(AP.year.ts)[1], as.vector(AP.year.ts), geom = 'line', xlab = 'Year', ylab = 'Yearly Aggregates')
#plot2 <- qplot(cycle, AP, data = data.frame(cycle = as.factor(AP.cycle.ts), AP = as.vector(AP.ts)), geom = 'boxplot', xlab = 'Month', ylab = 'Passengers');

#grid.arrange(plot1, plot2);

Worksheet Exercise 1.4

Repeat the above analysis for Maine unemployment data.

### We are going to aggregate over the years, and extract the cycles
MA.year.ts  <- aggregate(MA.month.ts);
MA.cycle.ts <- cycle(MA.month.ts);
### We are going to stack the two plots together
layout(1:2)
plot(MA.year.ts)
boxplot(MA.month.ts ~ MA.cycle.ts)

### Create a plot in ggplot2

#plot1 <- qplot(start(MA.year.ts)[1]:end(MA.year.ts)[1], as.vector(MA.year.ts), geom = 'line', xlab = 'Year', ylab = 'Yearly Aggregates')
#plot2 <- qplot(MA.cycle.ts, MA.month.ts, data = data.frame(cycle = as.factor(MA.cycle.ts), MA = as.vector(MA.month.ts)), geom = 'boxplot', xlab = 'Month', ylab = 'Passengers');


#grid.arrange(plot1, plot2);

Worksheet Exercise 1.5

Calculate the average monthly data for each of the above time series. Compare this to the actual monthly data and plot them together. What can we learn from this?

MA.year.ts        <- aggregate(MA.month.ts);
MA.annual.mean.ts <- MA.year.ts / 12;
layout(1:2)
plot(MA.month.ts,       ylab = "unemployed (%)")
plot(MA.annual.mean.ts, ylab = "unemployed (%)")

### We can also plot this in ggplot2
MA.month.vec       <- as.vector(MA.month.ts);
MA.annual.mean.vec <- as.vector(MA.annual.mean.ts);
qplot(1:length(MA.month.vec), MA.month.vec, geom = 'line', colour = I('red')) +
    geom_line(aes(x = -6 + (1:length(MA.annual.mean.vec)) * 12, y = MA.annual.mean.vec), colour = 'blue');

Worksheet Exercise 1.6

Using the function, calculate quantitative values for the above.

MA.02.ts <- window(MA.month.ts, start = c(1996, 2), freq = TRUE);
MA.08.ts <- window(MA.month.ts, start = c(1996, 8), freq = TRUE);

head(MA.02.ts); tail(MA.02.ts);
[1] 6.7 6.5 5.7 5.0 4.4 4.2
[1] 4.2 4.9 5.8 5.6 5.8 5.6
str(MA.02.ts);
 Time-Series [1:11] from 1996 to 2006: 6.7 6.5 5.7 5 4.4 4.2 4.9 5.8 5.6 5.8 ...
Feb.ratio <- mean(MA.02.ts) / mean(MA.month.ts);
Feb.ratio
[1] 1.222529
Aug.ratio <- mean(MA.08.ts) / mean(MA.month.ts);
Aug.ratio
[1] 0.8163732

Multivariate Time Series

Column

Multivariate Time Series

In many cases, we will also be dealing with time series that have multiple values at all, many or some of the points in time.

Often, these values will be related in some ways, and we will want to analyse those relationships also. In fact, one of the most efficient methods of prediction is to find leading indicators for the value or values you wish to predict — you can often use the current values of the leading indicators to make inference on future values of the related quantities.

The fact that this is one of the best methods in time series analysis says a lot about the difficulty of prediction (Yogi Berra, a US baseball player noted for his pithy statements, once said ``Prediction is difficult, especially about the future’’).

Worked Exercise 2.1

Load in the multivariate data from the file cbe.dat. Investigate the object type and some sample data to get an idea of how it is structured. The R functions head() and tail() will be of use for this.

Worked Exercise 2.2

Create time series objects for this data using ts(), and plot them beside each other. cbind() is useful for creating all the plots together.

Worked Exercise 2.3

Merge the electricity usage data with the US airline passenger data using ts.intersect and investigate any possible similarities between the two time series.

Worked Exercise 2.4

Use the cor() function, investigate the correlation between the two time series. How plausible is a causal effect in this case?

CBE Data

head(CBE.df)
  choc beer elec
1 1451 96.3 1497
2 2037 84.4 1463
3 2477 91.2 1648
4 2785 81.9 1595
5 2994 80.5 1777
6 2681 70.4 1824
tail(CBE.df)
    choc  beer  elec
391 7529 148.3 14002
392 8715 148.3 14338
393 8450 133.5 12867
394 9085 193.8 12761
395 8350 208.4 12449
396 7080 197.0 12658
str(CBE.df)
'data.frame':   396 obs. of  3 variables:
 $ choc: int  1451 2037 2477 2785 2994 2681 3098 2708 2517 2445 ...
 $ beer: num  96.3 84.4 91.2 81.9 80.5 70.4 74.8 75.9 86.3 98.7 ...
 $ elec: int  1497 1463 1648 1595 1777 1824 1994 1835 1787 1699 ...

Time Series Plots

beer.ts <- ts(CBE.df$beer, start = 1958, freq = 12);
choc.ts <- ts(CBE.df$choc, start = 1958, freq = 12);
elec.ts <- ts(CBE.df$elec, start = 1958, freq = 12);
plot(cbind(beer.ts, choc.ts, elec.ts));

Airline Passengers

elec.ts    <- ts(CBE.df$elec, start = 1958, freq = 12);
AP.elec.ts <- ts.intersect(AP.ts, elec.ts);
head(AP.elec.ts); tail(AP.elec.ts);
     AP.ts elec.ts
[1,]   340    1497
[2,]   318    1463
[3,]   362    1648
[4,]   348    1595
[5,]   363    1777
[6,]   435    1824
      AP.ts elec.ts
[31,]   622    2287
[32,]   606    2276
[33,]   508    2096
[34,]   461    2055
[35,]   390    2004
[36,]   432    1924
str(AP.elec.ts);
 Time-Series [1:36, 1:2] from 1958 to 1961: 340 318 362 348 363 435 491 505 404 359 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "AP.ts" "elec.ts"
plot(AP.elec.ts);

Create a plot in ggplot2

qplot(Var1, value, data = melt(AP.elec.ts), geom = 'line', colour = Var2);

elec.ts     <- ts(CBE.df$elec, start = 1958, freq = 12);
AP.elec.ts  <- ts.intersect(AP.ts, elec.ts);
AP.elec.cor <- cor(AP.elec.ts);

str(AP.elec.cor);
 num [1:2, 1:2] 1 0.884 0.884 1
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "AP.ts" "elec.ts"
  ..$ : chr [1:2] "AP.ts" "elec.ts"
### Show the scaled plot
qplot(Var1, value, data = melt(scale(AP.elec.ts)), geom = 'line', colour = Var2);

Time Series Decomposition

Column

Time Series Decomposition

Since many time series are dominated by trends or seasonal effects, and we can create fairly simple models based on these two components. The first of these, the , is just the sum of these effects, with the residual component being treated as random:

\[ x_t = m_t + s_t + z_t, \]

where, at any given time \(t\), \(x_t\) is the observed value, \(m_t\) is trend, \(s_t\) is the seasonal component, and \(z_t\) is the error term.

It is worth noting that, in general, the error terms will be a correlated sequence of values, something we will account for and model later.

In other cases, we could have a situation where the seasonal effect increases as the trend increases, modeling the values as:

\[ x_t = m_t s_t + z_t. \]

Other options also exist, such as modeling the log of the observed values, which does cause some non-trivial modeling issues, such as biasing any predicted values for the time series.

Various methods are used for estimating the trend, such as taking a of the values, which is a common approach.

Worked Exercise 3.1

Using the decompose() function in R, look at the trend and the seasonal variation for the airline passenger data. The output of this function can be plotted directly, and visually check the output. Does the output match your intuition about what you observed?

Worked Exercise 3.2

Repeat this process for the CBE dataset.

Worked Exercise 3.3

Try a multiplicative model for all of the above. decompose() allows the selection of this via the ```type}’ parameter. Is the multiplicative model better? In either case, explain why this might be.

Worked Exercise 3.4

Repeat the above, but use the stl() R function instead of decompose(). Compare the output of the two.

Solutions

Worksheet Exercise 3.1

Using the decompose() function in R, look at the trend and the seasonal variation for the airline passenger data. The output of this function can be plotted directly, and visually check the output. Does the output match your intuition about what you observed?

AP.ts.decomp <- decompose(AP.ts);

str(AP.ts.decomp);
List of 6
 $ x       : Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
 $ seasonal: Time-Series [1:144] from 1949 to 1961: -24.75 -36.19 -2.24 -8.04 -4.51 ...
 $ trend   : Time-Series [1:144] from 1949 to 1961: NA NA NA NA NA ...
 $ random  : Time-Series [1:144] from 1949 to 1961: NA NA NA NA NA ...
 $ figure  : num [1:12] -24.75 -36.19 -2.24 -8.04 -4.51 ...
 $ type    : chr "additive"
 - attr(*, "class")= chr "decomposed.ts"
plot(AP.ts.decomp);

Worksheet Exercise 3.2

Repeat this process for the CBE dataset.

beer.ts <- ts(CBE.df$beer, start = 1958, freq = 12);
choc.ts <- ts(CBE.df$choc, start = 1958, freq = 12);
elec.ts <- ts(CBE.df$elec, start = 1958, freq = 12);
beer.ts.decomp <- decompose(beer.ts);
choc.ts.decomp <- decompose(choc.ts);
elec.ts.decomp <- decompose(elec.ts);
cbe.ts <- cbind(beer.ts, choc.ts, elec.ts);

plot(cbe.ts);

plot(beer.ts.decomp);

plot(choc.ts.decomp);

plot(elec.ts.decomp);

Worksheet Exercise 3.3

Try a multiplicative model for all of the above. decompose() allows the selection of this via the ```type}’ parameter. Is the multiplicative model better? In either case, explain why this might be.

AP.ts.mult.decomp <- decompose(AP.ts, type = 'multiplicative');

plot(AP.ts.mult.decomp);

Worksheet Exercise 3.4

Repeat the above, but use the stl() R function instead of decompose(). Compare the output of the two.

AP.ts.stl <- stl(AP.ts, s.window = 'periodic');

str(AP.ts.stl);
List of 8
 $ time.series: Time-Series [1:144, 1:3] from 1949 to 1961: -25.5 -35.22 -3.03 -8.3 -5.74 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "seasonal" "trend" "remainder"
 $ weights    : num [1:144] 1 1 1 1 1 1 1 1 1 1 ...
 $ call       : language stl(x = AP.ts, s.window = "periodic")
 $ win        : Named num [1:3] 1441 19 13
  ..- attr(*, "names")= chr [1:3] "s" "t" "l"
 $ deg        : Named int [1:3] 0 1 1
  ..- attr(*, "names")= chr [1:3] "s" "t" "l"
 $ jump       : Named num [1:3] 145 2 2
  ..- attr(*, "names")= chr [1:3] "s" "t" "l"
 $ inner      : int 2
 $ outer      : int 0
 - attr(*, "class")= chr "stl"
plot(AP.ts.stl);

### To compare the two, I give an example of comparing the two calculated trend
plot.df <- melt(cbind(decomp = AP.ts.decomp$trend,  stl = AP.ts.stl$time.series[, 'trend']));

head(plot.df);
  Var1   Var2 value
1    1 decomp    NA
2    2 decomp    NA
3    3 decomp    NA
4    4 decomp    NA
5    5 decomp    NA
6    6 decomp    NA
qplot(Var1, value, data = plot.df, colour = Var2, geom = 'line', ylim = c(0, 500));

Autocorrelation

Column

Autocorrelation

Assuming we can remove the trend and the seasonal variation, that still leaves the random component, \(z_t\). Unfortunately, analysing this is usually highly non-trivial. As discussed, we model the random component as a sequence of random variables, but no further assumptions we made.

To simplify the analysis, we often make assumptions like random variables, but this will rarely work well. Most of the time, the \(z_t\) are correlated.

The or of a random variable \(x\), denoted \(E(x)\), is the mean value of \(x\) in the population. So, for a continuous \(x\), we have

\[ \mu = E(x) = \int p(x) \, x \, dx. \]

and the , \(\sigma^2\), is the expectation of the squared deviations,

\[ \sigma^2 = E[(x - \mu)^2], \]

For bivariate data, each datapoint can be represented as \((x, y)\) and we can generalise this concept to the , \(\gamma(x, y)\),

\[\begin{equation} \gamma(x, y) = E[(x - \mu_x)(y - \mu_y)]. \end{equation}\]

Correlation, \(\rho\), is the standardised covariance, dividing the covariance by the standard deviation of the two variables,

\[\begin{equation} \rho(x, y) = \frac{\gamma(x, y)}{\sigma_x \sigma_y}. \end{equation}\]

The mean function of a time series model is

\[\begin{equation} \mu(t) = E(x_t), \end{equation}\]

with the expectation in this case being across the of possible time series that might have been produced by this model. Of course, in many cases, we only have one realisation of the model, and so, without any further assumption, estimate the mean to be the measured value.

If the mean function is constant, we say that the time-series is , and the estimate of the population mean is just the sample mean,

\[\begin{equation} \mu = \sum^n_{t=1} x_t. \end{equation}\]

The variance function of a time-series model that is stationary in the mean is given by

\[\begin{equation} \sigma^2(t) = E[(x_t - \mu)^2], \end{equation}\]

and if we make the further assumption that the time-series is also stationary in the variance, then the population variance is just the sample variance

\[\begin{equation} \text{Var}(x) = \frac{\sum(x_t - \mu)^2}{n - 1} \end{equation}\]

Autocorrelation, often referred to as , is the correlation between the random variables at different time intervals. We can define the and the as functions of the , \(k\), as

\[\begin{eqnarray} \gamma_k &=& E[(x_t - \mu)(x_{t+k} - \mu)], \\ \rho_k &=& \frac{\gamma_k}{\sigma^2}. \end{eqnarray}\]

Be default, the acf() function plots the , which is a plot of the sample autocorrelation at \(r_k\) against the lag \(k\).

Exercise 4.1

Using the function acf(), calculate the autocorrelations for all the time series we have looked at. Look at the structure of the output, and use the help system to see what options are provided.

Exercise 4.2

Check the output of acf() against manual calculations of the correlations at various timesteps. Do the numbers match?

The cor() function and some vector indexing will be helpful here.

Exercise 4.3

Plot the output of the acf() for the different time series. Think about what these plots are telling you. Do do these plots help the modelling process, if so, how?

Exercise 4.4

Decompose the air passenger data and look at the appropriate correlogram. What does this plot tell you? How does it differ from the previous correlogram you looked at?

How can we use all that we have learned so far to assess the efficacy of the decompositional approach for time series?

Solutions

Worksheet Exercise 4.1

AP.ts.acf <- acf(AP.ts, plot = FALSE);

str(AP.ts.acf);
List of 6
 $ acf   : num [1:22, 1, 1] 1 0.948 0.876 0.807 0.753 ...
 $ type  : chr "correlation"
 $ n.used: int 144
 $ lag   : num [1:22, 1, 1] 0 0.0833 0.1667 0.25 0.3333 ...
 $ series: chr "AP.ts"
 $ snames: NULL
 - attr(*, "class")= chr "acf"

Worksheet Exercise 4.2

AP.ts.acf <- acf(AP.ts, plot = FALSE);


calc.autocor <- function(x, lag) {
    N <- length(x);

    idx1 <- 1:(N-lag);
    idx2 <- (lag+1):N;

    cor(AP.ts[idx1], AP.ts[idx2]);
}
### Adding 1 to the index of the acf as acf[1] is the lag at 0 (i.e. it has value 1)
lag <- 1;
AP.ts.acf$acf[lag+1]
[1] 0.9480473
calc.autocor(AP.ts, lag);
[1] 0.9601946
lag <- 3;
AP.ts.acf$acf[lag+1]
[1] 0.8066812
calc.autocor(AP.ts, lag);
[1] 0.8373948
lag <- 12;
AP.ts.acf$acf[lag+1]
[1] 0.760395
calc.autocor(AP.ts, lag);
[1] 0.9905274

Worksheet Exercise 4.3

AP.ts.acf <- acf(AP.ts, plot = FALSE);

plot(AP.ts.acf, ylim = c(-1, 1));

Worksheet Exercise 4.4

AP.ts.acf <- acf(AP.ts, plot = FALSE);
AP.ts.decomp        <- decompose(AP.ts);
AP.ts.decomp.random <- AP.ts.decomp$random[!is.na(AP.ts.decomp$random)]
AP.ts.decomp.random.acf <- acf(AP.ts.decomp.random, plot = FALSE);
layout(1:2);
plot(AP.ts.acf,               ylim = c(-1, 1));
plot(AP.ts.decomp.random.acf, ylim = c(-1, 1))

Basic Forecasting

Column

Basic Forecasting

Basic Forecasting

As mentioned earlier, an efficient way to forecast a variable is to find a related variable whose value leads it by one or more timesteps. The closer the relationship and the longer the lead time, the better it becomes.

The trick, of course, is to find a leading variable.

Multivariate series has a temporal equivalent to correlation and covariance, known as the cross-covariance function (ccvf) and the ,

\[\begin{eqnarray} \gamma_k(x, y) &=& E[(x_{t+k} - \mu_x)(y_t - \mu_y)], \\ \rho_k(x, y) &=& \frac{\gamma_k(x, y)}{\sigma_x \sigma_y}. \end{eqnarray}\]

Note that the above functions are not symmetric, as the lag is always on the first variable, \(x\).

Worksheet Exercise 5.1

Load the building approvals and activity data from the ApprovActiv.dat file. The data is quarterly and starts in 1996. Determine which is the leading variable and investigate the relationship between the two.

Worksheet Exercise 5.2

Binding the time-series using ts.union(), find the cross-correlations for the building data.

Is the relationship symmetric, and why?

Worksheet Exercise 5.3

Examine the cross-correlations of the random element of the decomposed time-series for the building data, and compare this to the original cross-correlations.

Exponentially-weighted moving average (EWMA)

Our main objective in forecasting is to estimate the value of a future quantity, \(x_{n+k}\), given past values \({x_1, x_2, ..., x_n}\). We assume no seasonal or trend effects, or any such effects have been removed from the data. We assume that the underlying mean of the data is \(\mu_t\), and that this value changes from timestep to timestep, but this change is random.

Our model can be expressed as

\[\begin{equation} x_t = \mu_t + w_t, \end{equation}\]

where \(\mu_t\) is the non-stationary mean of the process at time \(t\) and \(w_t\) are independent random variates with mean \(0\) and standard deviation \(\sigma\). We let \(a_t\) be our estimate of \(\mu_t\), and can define the exponentially-weighted moving average (EWMA), \(a_t\) to be

\[\begin{equation} a_t = \alpha x_t + (1 - \alpha) a_{t-1}, \;\;\; 0 \leq \alpha \leq 1. \end{equation}\]

The value of \(\alpha\) controls the amount of smoothing, as is referred to as the smoothing parameter.

Worksheet Exercise 5.4

Load the data in the motororg.dat file. This is a count of complaints received on a monthly basis by a motoring organisation from 1996 to 1999. Create an appropriate time series from this data. Plot the data, checking it for trends or seasonality.

Worksheet Exercise 5.5

Using the function HoltWinters(), with the additional parameters set to zero, create the EWMA of the data, allowing the function itself to choose the optimal value of \(\alpha\). Investigate and visualise the output, comparing it to the original time series.

Worksheet Exercise 5.6

Specifying values of \(\alpha\) of 0.2 and 0.9, create new versions of the EWMA and compare them with previous fits of the EWMA.

The Holt-Winters Method

The Holt-Winters method generalises this concept, allowing for trends and seasonal effects. The equations that govern this model for seasonal period, \(p\), are given by

\[\begin{eqnarray} a_t &=& \alpha (x_t - s_{t-p}) + (1 - \alpha)(a_{t-1} - b_{t-1}), \nonumber \\ b_t &=& \beta (a_t - a_{t-1}) + (1 - \beta)b_{t-1},\\ s_t &=& \gamma (x_t - a_t) + (1 - \gamma) s_{t-p}, \nonumber \end{eqnarray}\]

where \(a_t\), \(b_t\), \(s_t\) are the estimated level, slope and seasonal effect at time \(t\), and \(\alpha\), \(\beta\) and \(\gamma\) are the smoothing parameters.

Worksheet Exercise 5.7

Fit the Holt-Winters parameters to the air passenger data and check the fit. Visualise the raw time-series against the fitted data.

Worksheet Exercise 5.8

Predict data ahead for four years and visualise this data. How reliable are these forecasts do you think?

Solutions

Worksheet Exercise 5.1

App.ts <- ts(AA.df$Approvals, start = c(1996,1), freq=4);
Act.ts <- ts(AA.df$Activity,  start = c(1996,1), freq=4);


ts.plot(App.ts, Act.ts, lty = c(1, 3), ylim = c(0, 16000));

Worksheet Exercise 5.2

AppAct.ts     <- ts.union(App.ts, Act.ts);
AppAct.ts.acf <- acf(AppAct.ts, plot = FALSE);


plot(AppAct.ts.acf);

Worksheet Exercise 5.3

App.ts.decomp <- decompose(App.ts);
App.rnd.ts    <- window(App.ts.decomp$random, start = c(1996, 3));
Act.ts.decomp <- decompose(Act.ts);
Act.rnd.ts    <- window(Act.ts.decomp$random, start = c(1996, 3));
AppAct.rnd.ts.acf <- acf(ts.union(App.rnd.ts, Act.rnd.ts), na.action = na.pass, plot = FALSE);
AppAct.rnd.ts.ccf <- ccf(App.rnd.ts, Act.rnd.ts, na.action = na.pass, plot = FALSE);
plot(AppAct.rnd.ts.acf);

plot(AppAct.rnd.ts.ccf);

Worksheet Exercise 5.4

Complaints.ts <- ts(motororg.df$complaints, start = c(1996, 1), freq = 12)
plot(Complaints.ts, xlab = "Time / months", ylab = "Complaints");

Complaints.ts.decomp <- decompose(Complaints.ts);
plot(Complaints.ts.decomp);

Worksheet Exercise 5.5

Complaints.hw1 <- HoltWinters(Complaints.ts, beta = FALSE, gamma = FALSE);
print(Complaints.hw1);
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = Complaints.ts, beta = FALSE, gamma = FALSE)

Smoothing parameters:
 alpha: 0.1429622
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 17.70343
plot(Complaints.hw1);

Worksheet Exercise 5.6

Complaints.hw2 <- HoltWinters(Complaints.ts, alpha = 0.2, beta = FALSE, gamma = FALSE);

print(Complaints.hw2);
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = Complaints.ts, alpha = 0.2, beta = FALSE, gamma = FALSE)

Smoothing parameters:
 alpha: 0.2
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 17.97913
plot(Complaints.hw2);

Complaints.hw3 <- HoltWinters(Complaints.ts, alpha = 0.9, beta = FALSE, gamma = FALSE);

print(Complaints.hw3);
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = Complaints.ts, alpha = 0.9, beta = FALSE, gamma = FALSE)

Smoothing parameters:
 alpha: 0.9
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 22.32544
plot(Complaints.hw3);

layout(1:3);
plot(Complaints.hw1);
plot(Complaints.hw2);
plot(Complaints.hw3);

Worksheet Exercise 5.7

AP.ts.add.hw <- HoltWinters(AP.ts, seasonal = "add");
plot(AP.ts.add.hw);

AP.ts.mult.hw <- HoltWinters(AP.ts, seasonal = "mult");
plot(AP.ts.mult.hw);

Worksheet Exercise 5.8

AP.ts.mult.hw.predict <- predict(AP.ts.mult.hw, n.ahead = 4 * 12);

ts.plot(AP.ts, AP.ts.mult.hw.predict, lty = 1:2);

ARMA and ARIMA Models

Column

ARMA and ARIMA Models

ARMA and ARIMA Models

Now suppose we combine the ideas of autoregressive and moving average models together. A time series follows an autoregressive moving average (ARMA) process of order \((p, q)\) when

\[\begin{equation} x_t = \sum_{i=1}^p \alpha_i x_{t-i} + w_t + \sum_{j=1}^q \beta_j w_{t-j} \end{equation}\]

where \(w_t\) is white noise.

Both \(\text{AR}(p)\) and \(\text{MA}(q)\) models are special cases of \(\text{ARMA}(p, q)\) (with \(q = 0\) and \(p = 0\) respectively), and ARMA models are usually preferred due to — when fitting data, the ARMA model is usually more parameter efficient, requiring few parameters.

Using the R function arima.sim(), create an ARMA(1,1) time-series of length 1000 with \(\alpha = -0.6\), and \(\beta = 0.5\). Plot this time series and check its ACF. Is this correct?

Using arima(), fit your generated time series to an \(\text{ARMA}(1,1)\) model and compare the fitted output to the values you have set.

Repeat the above exercises, but for an \(\text{ARMA}(2, 2)\) model with \(\alpha_1 = 0.2\), \(\alpha_2 = -0.5\), \(\beta_1 = -0.1\) and \(\beta_2 = 0.3\).

Investigate the effect of the parameters \(p\) and \(q\) by generating the various combinations of the \(\text{ARMA}(p, q)\) models but using the same set of innovations in each case.\ arima.sim() has a parameter innov = ... that allows you to pass in a set of innovations into the ARMA process.

Load in the GBP/NZD currency pair data from the file pound_nz.dat, and create a time-series from this. The data is quarterly, and starts in Q1 1991.

Fit the data GBP/NZD to an \(\text{MA}(1)\), an \(\text{AR}(1)\) and an \(\text{ARMA}(1, 1)\) process. Which of the above models does the best job at fitting the data?

It may be becoming quickly apparent that the choice of parameter count is non-trivial, and some kind of systematic approach would be desirable.

One such method for choosing the optimal number of parameters is to fit multiple options and choose the best one, using a metric known as the . The AIC is defined to be

\[\begin{equation} \text{AIC} = -2 \times \text{loglikelihood of fit} + 2 \times \text{parameter count}, \end{equation}\]

so that it balances the better fitting of parameters while penalising using too many parameters to fit.

Use the R function AIC() to calculate the AIC of the three models above. Which one is the best? Why is this question a trap?

Using all of the various techniques in this workshop, try to model the electricity time series data from the CBE dataset.

Development

Column

Storage Modes

Nile

Let us look at the Nile data set, one of R’s embedded data set.

Nile
Time Series:
Start = 1871 
End = 1970 
Frequency = 1 
  [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020
 [16]  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220 1030 1100  774  840
 [31]  874  694  940  833  701  916  692 1020 1050  969  831  726  456  824  702
 [46] 1120 1100  832  764  821  768  845  864  862  698  845  744  796 1040  759
 [61]  781  865  845  944  984  897  822 1010  771  676  649  846  812  742  801
 [76] 1040  860  874  848  890  744  749  838 1050  918  986  797  923  975  815
 [91] 1020  906  901 1170  912  746  919  718  714  740
class(Nile)
[1] "ts"
str(Nile)
 Time-Series [1:100] from 1871 to 1970: 1120 1160 963 1210 1160 1160 813 1230 1370 1140 ...
mode(Nile)
[1] "numeric"
head(Nile)
[1] 1120 1160  963 1210 1160 1160

LA Rain

library(TSA)

dev.new(width=8, height=4)
data(larain)
plot(larain,ylab='Inches',xlab='Year',type='o')
zlag(larain,d=1)
  [1]    NA 20.86 17.41 18.65  5.53 10.74 14.14 40.29 10.53 16.72 16.02 20.82
 [13] 33.26 12.69 12.84 18.72 21.96  7.51 12.55 11.80 14.28  4.83  8.69 11.30
 [25] 11.96 13.12 14.77 11.88 19.19 21.46 15.30 13.74 23.92  4.89 17.85  9.78
 [37] 17.17 23.21 16.67 23.29  8.45 17.49  8.82 11.18 19.85 15.27  6.25  8.11
 [49]  8.94 18.56 18.63  8.69  8.32 13.02 18.93 10.72 18.76 14.67 14.49 18.24
 [61] 17.97 27.16 12.06 20.26 31.28  7.40 22.57 17.45 12.78 16.22  4.13  7.59
 [73] 10.63  7.38 14.33 24.95  4.08 13.69 11.89 13.62 13.24 17.49  6.23  9.57
 [85]  5.83 15.37 12.31  7.98 26.81 12.91 23.66  7.58 26.32 16.54  9.26  6.54
 [97] 17.45 16.69 10.70 11.01 14.97 30.57 17.00 26.33 10.92 14.41 34.04  8.90
[109]  8.92 18.00  9.11 11.57  4.56  6.49 15.07
dev.new(width=7, height=7)
plot(y=larain,x=zlag(larain,d=1),ylab='Inches',
xlab='Previous Year Inches')
cor(y=larain,x=zlag(larain,d=1), use="pairwise")
[1] -0.03308892
dev.new(width=8, height=4)
data(color)
plot(color,ylab='Color Property',xlab='Batch',type='o')
zlag(color,d=1)
 [1] NA 67 63 76 66 69 71 72 71 72 72 83 87 76 79 74 81 76 77 68 68 74 68 69 75
[26] 80 81 86 86 79 78 77 77 80 76
dev.new(width=7, height=7)
plot(y=color,x=zlag(color,d=1),ylab='Color Property',
xlab='Previous Batch Color Property')
cor(y=color,x=zlag(color,d=1), use="pairwise")
[1] 0.554917

Hares

dev.new(width=8, height=4)
data(hare)
plot(hare,ylab='Abundance',xlab='Year',type='o')
zlag(hare,d=1)
 [1] NA 50 20 20 22 27 50 55 78 70 59 28 20 15 15 25 35 65 78 82 65 26 15 10  1
[26]  2  3 22 75 95 78
dev.new(width=7, height=7)
plot(y=hare,x=zlag(hare,d=1),ylab='Abundance',
xlab='Previous Year Abundance')
cor(y=hare,x=zlag(hare,d=1), use="pairwise")
[1] 0.7025777
dev.new(width=7, height=7)
plot(y=hare,x=zlag(hare,d=2),ylab='Abundance',
xlab='Abundance Two Year\'s Ago')
cor(y=color,x=zlag(color,d=2), use="pairwise")
[1] 0.3651124

Temperature

dev.new(width=8, height=4)
data(tempdub)
plot(tempdub,ylab='Temperature',type='o')
dev.new(width=8, height=4)
plot(tempdub,ylab='Temperature',type='l')
points(y=tempdub,x=time(tempdub),
pch=as.vector(season(tempdub)), col=4, cex=0.8 )
aggregate(tempdub~season(tempdub), data=tempdub, mean)
   season(tempdub)  tempdub
1          January 16.60833
2         February 20.65000
3            March 32.47500
4            April 46.52500
5              May 58.09167
6             June 67.50000
7             July 71.71667
8           August 69.33333
9        September 61.02500
10         October 50.97500
11        November 36.65000
12        December 23.64167
dev.new(width=7, height=7)
boxplot(tempdub~season(tempdub), ylab="Temperature")
dev.new(width=7, height=7)
plot(y=tempdub,x=zlag(tempdub,d=1),ylab='Temperature',
xlab='Previous Month Temperature')
cor(y=tempdub,x=zlag(tempdub,d=1), use="pairwise")
[1] 0.8395499

Oil

dev.new(width=8, height=4)
data(oilfilters)
plot(oilfilters,ylab='Sales',type='o')
dev.new(width=8, height=4)
plot(oilfilters,ylab='Sales',type='l')
points(y=oilfilters,x=time(oilfilters),
pch=as.vector(season(oilfilters)), col=4, cex=0.8 )
aggregate(oilfilters~season(oilfilters), data=oilfilters, mean)
   season(oilfilters) oilfilters
1             January    5505.75
2            February    5361.00
3               March    2978.00
4               April    4591.50
5                 May    3705.50
6                June    3404.25
7                July    3102.00
8              August    2676.75
9           September    2271.75
10            October    2470.25
11           November    2178.25
12           December    2409.00
dev.new(width=7, height=7)
boxplot(oilfilters~season(oilfilters), ylab="Sales")
dev.new(width=7, height=7)
plot(y=oilfilters,x=zlag(oilfilters,d=1),ylab='Sales',
xlab='Previous Month Sales')
cor(y=oilfilters,x=zlag(oilfilters,d=1), use="pairwise")
[1] 0.3142145